Are two wrong models better than one? Missing data style

rstats

simulations

missing data

Would imputing + fitting an outcome model using the wrong variables be better than just fitting the wrong outcome model? Let’s investigate!

Author

Lucy D’Agostino McGowan

Published

April 29, 2023

After my previous post about missing data, Kathy asked on Twitter whether two wrong models (the imputation model + the outcome model) would be better than one (the outcome model alone).

Without doing any of the math, I’d guess the assumption of correctly spec the model also has a bigger impact in the CC analysis.

You need correct spec in MI, twice, but trade off that potential bias for higher prec.

This is a great question! I am going to investigate via a small simulation (so the answer could be “it depends”, but at least we will know how it seems to work in this very simple case) 😆.

Ok so here I have some predictor, x that is missing 50% of the time, dependent on c_x and c_y. The right imputation model would have c_x, the right outcome model needs c_y. Unfortunately, we only have access to one, which we will try to use in our imputation model (and outcome model). Let’s see whether two (wrong) models are better than one!

A “correct” model will be one that estimates that the coefficient for x is 1.

library(tidyverse)
library(broom)
library(mice)

n <- 1000

set.seed(928)

data <- tibble(
  c_x = rnorm(n, sd = 0.71),
  x = c_x + rnorm(n, sd = 0.71),
  c_y = rnorm(n),
  y = x + c_y + rnorm(n),
  noise = rnorm(n),
  x_miss = rbinom(n, 1, 1 / (1 + exp(-(c_x + c_y)))),
  x_obs = ifelse(
    x_miss,
    NA,
    x
  )
)

We only have `c_x`

Ok first let’s look at the whole dataset.

mod_full_c_x <- lm(y ~ x + c_x, data = data) |>
  tidy(conf.int = TRUE) |>
  filter(term == "x") |>
  select(estimate, conf.low, conf.high)

mod_full_c_x

# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1     1.01    0.877      1.14

This checks out! c_x basically does nothing for us here, but because c_y is not actually a confounder (it just informs the missingness & y, which we aren’t observing here), we are just fine estimating our “wrong” model in the fully observed data. Now let’s do the “complete cases” analysis.

data_cc <- na.omit(data)
mod_cc_c_x <- lm(y ~ x + c_x, data = data_cc) |>
  tidy(conf.int = TRUE) |>
  filter(term == "x") |>
  select(estimate, conf.low, conf.high)

mod_cc_c_x

# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1    0.999    0.812      1.19

This does fine! Now let’s do some imputation. I am going to use the mice package.

imp_data_c_x <- mice(
  data, 
  m = 5, 
  method = "norm.predict",
  formulas = list(x_obs ~ c_x),
  print = FALSE)

Ok let’s compare how this model does “alone”.

mod_imp_c_x <- with(imp_data_c_x, lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_c_x

  estimate  conf.low conf.high
1 1.026666 0.9042858  1.149046

Great! This was the right model, so we would expect this to perform well.

Now what happens if we adjust for c_x in addition in the outcome model:

mod_double_c_x <- with(imp_data_c_x, lm(y ~ x_obs + c_x)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_c_x

   estimate  conf.low conf.high
1 0.9991868 0.7968509  1.201523

The right imputation model with the wrong outcome model is fine!

We only have `c_y`

Ok first let’s look at the whole dataset.

mod_full_c_y <- lm(y ~ x + c_y, data = data) |>
  tidy(conf.int = TRUE) |>
  filter(term == "x") |>
  select(estimate, conf.low, conf.high)

mod_full_c_y

# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1     1.01    0.946      1.08

Looks good! Now let’s do the “complete cases” analysis.

mod_cc_c_y <- lm(y ~ x + c_y, data = data_cc) |>
  tidy(conf.int = TRUE) |>
  filter(term == "x") |>
  select(estimate, conf.low, conf.high)

mod_cc_c_y

# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1     1.01    0.909      1.11

Great! It works. This shows that as long as we have the right outcome model we can do complete case analysis even if the data is missing not at random (cool!). Now let’s do some imputation.

imp_data_c_y <- mice(
  data, 
  m = 5, 
  method = "norm.predict",
  formulas = list(x_obs ~ c_y),
  print = FALSE)

Ok let’s compare how this model does “alone”.

mod_imp_c_y <- with(imp_data_c_y,lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_c_y

   estimate conf.low conf.high
1 0.6888255 0.527796  0.849855

Oh no, very bad! The wrong imputation model is worse than complete case! By a lot! This estimate is off by 0.31. Does conditioning on c_y help us at all?

mod_double_c_y <-  with(imp_data_c_y, lm(y ~ x_obs + c_y)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_c_y

  estimate  conf.low conf.high
1 1.009655 0.8887281  1.130582

Phew, the wrong imputation model with the wrong outcome model is back to being fine.

What if both are wrong?

Ok, what if we just had our useless variable, noise.

mod_full_noise <- lm(y ~ x + noise, data = data) |>
  tidy(conf.int = TRUE) |>
  filter(term == "x") |>
  select(estimate, conf.low, conf.high)

mod_full_noise

# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1    0.992    0.898      1.09

This is fine! c_x and c_y aren’t confoudners so we can estimate the coefficent for x without them – noise doesn’t do anything, but it also doesn’t hurt. What about complete case?

mod_cc_noise <- lm(y ~ x + noise, data = data_cc) |>
  tidy(conf.int = TRUE) |>
  filter(term == "x") |>
  select(estimate, conf.low, conf.high)

mod_cc_noise

# A tibble: 1 × 3
  estimate conf.low conf.high
     <dbl>    <dbl>     <dbl>
1    0.887    0.748      1.03

Oops! We’ve got bias (as expected!) – we end up with a biased estimate by ~0.11.

What if we build the (wrong) imputation model?

imp_data_noise <- mice(
  data, 
  m = 5, 
  method = "norm.predict",
  formulas = list(x_obs ~ noise),
  print = FALSE)

Ok let’s compare how this model does “alone”.

mod_imp_noise <-  with(imp_data_noise, lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_noise

   estimate  conf.low conf.high
1 0.8807755 0.7217368  1.039814

This is also wrong (womp womp!) What if we try two wrong models?

mod_double_noise <-  with(imp_data_noise,lm(y ~ x_obs + noise)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_noise

   estimate  conf.low conf.high
1 0.8865363 0.7275635  1.045509

Nope 😢. Two wrong models here are not better than one! It’s worse! Womp womp.

Let’s put these all together:

bind_rows(
  mod_full_c_x,
  mod_full_c_y,
  mod_full_noise,
  mod_cc_c_x,
  mod_cc_c_y,
  mod_cc_noise,
  mod_imp_c_x,
  mod_imp_c_y,
  mod_imp_noise,
  mod_double_c_x,
  mod_double_c_y,
  mod_double_noise
) |>
  mutate(
    mod = factor(c("Full data with c_x", 
                   "Full data with c_y", 
                   "Full data with noise",
                   "Complete case with c_x",
                   "Complete case wtih c_y",
                   "Complete case with noise",
                   "Imputation with c_x",
                   "Imputation with c_y",
                   "Imputation with noise",
                   "Two models with c_x",
                   "Two models with c_y",
                   "Two models with noise" ),
                 levels = c("Full data with c_x", 
                            "Complete case with c_x",
                            "Imputation with c_x",
                            "Two models with c_x",
                            "Full data with c_y", 
                            "Complete case wtih c_y",
                            "Imputation with c_y",
                            "Two models with c_y",
                            "Full data with noise",
                            "Complete case with noise",
                            "Imputation with noise",
                            "Two models with noise" )),
    mod = fct_rev(mod),
  ) -> to_plot

ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
  geom_pointrange() + 
  geom_vline(xintercept = 1, lty = 2)

So there you have it, two wrong models are rarely better than one.

Addendum!

In writing this post, I found that I was getting biased results when I was correctly specifying my imputation model when using the {mice} defaults (which is why the code above specifies norm.predict for the method, forcing it to use linear regression, as the data were generated). I didn’t understand why this is happening until some helpful friends on Twitter explained it (thank you Rebecca, Julian, and Mario. I’ll show you what is happening and then I’ll show a quick explanation. Let’s try to redo the imputation models using the defaults:

imp_default_c_x <- mice(
  data, 
  m = 5, 
  formulas = list(x_obs ~ c_x),
  print = FALSE)

mod_imp_c_x <- with(imp_default_c_x, lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_c_x

   estimate  conf.low conf.high
1 0.7353276 0.6194273 0.8512279

Bad!

mod_double_c_x <-  with(imp_default_c_x, lm(y ~ x_obs + c_x)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_c_x

   estimate conf.low conf.high
1 0.4602637  0.30023 0.6202974

Even worse!!

imp_default_c_y <- mice(
  data, 
  m = 5, 
  formulas = list(x_obs ~ c_y),
  print = FALSE)

mod_imp_c_y <- with(imp_default_c_y, lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_c_y

   estimate  conf.low conf.high
1 0.3405789 0.1126282 0.5685295

YIKES!

mod_double_c_y <-  with(imp_default_c_y, lm(y ~ x_obs + c_y)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_c_y

   estimate  conf.low conf.high
1 0.5128878 0.3216547 0.7041208

Better since we are conditioning on c_y (but still bad!)

imp_default_noise <- mice(
  data, 
  m = 5, 
  formulas = list(x_obs ~ noise),
  print = FALSE)

mod_imp_noise <-  with(imp_default_noise, lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_noise

   estimate  conf.low conf.high
1 0.4076967 0.2352155 0.5801779

EEK!

mod_double_noise <-  with(imp_default_noise,lm(y ~ x_obs + noise)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_noise

   estimate  conf.low conf.high
1 0.4124791 0.2497624 0.5751959

Just as bad..

Let’s put those in the original plot:

bind_rows(
  mod_full_c_x,
  mod_full_c_y,
  mod_full_noise,
  mod_cc_c_x,
  mod_cc_c_y,
  mod_cc_noise,
  mod_imp_c_x,
  mod_imp_c_y,
  mod_imp_noise,
  mod_double_c_x,
  mod_double_c_y,
  mod_double_noise
) |>
  mutate(
    mod = factor(c("Full data with c_x", 
                   "Full data with c_y", 
                   "Full data with noise",
                   "Complete case with c_x",
                   "Complete case wtih c_y",
                   "Complete case with noise",
                   "Default Imputation with c_x",
                   "Default Imputation with c_y",
                   "Default Imputation with noise",
                   "Two models with c_x",
                   "Two models with c_y",
                   "Two models with noise" ),
                 levels = c("Full data with c_x", 
                            "Complete case with c_x",
                            "Default Imputation with c_x",
                            "Two models with c_x",
                            "Full data with c_y", 
                            "Complete case wtih c_y",
                            "Default Imputation with c_y",
                            "Two models with c_y",
                            "Full data with noise",
                            "Complete case with noise",
                            "Default Imputation with noise",
                            "Two models with noise" )),
    mod = fct_rev(mod),
  ) -> to_plot

ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
  geom_pointrange() + 
  geom_vline(xintercept = 1, lty = 2)

AHH! This makes me so scared of imputation!!

Rebecca Andridge’s tweet finally helped me see why this is happening. The way the missing data is generated, larger values of c_x have a higher probability of missingness, and for particularly high values of c_x that probability is almost 1.

ggplot(data, aes(x = x, y = y, color = factor(x_miss))) +
  geom_point() + 
  geom_vline(xintercept = 2.31, lty = 2) +
  labs(color = "missing")

Take a look at the plot above. We have no non-missing x values that are greater than 2.3. The way predictive mean matching (the default {mice} method) works is it finds the observation(s) that have the closest predicted value to the observation that is missing a data point and gives you that non-missing data point’s value. So here, we are essentially truncating our distribution at 2.3, since that is the highest value observed. Any value that would have been higher is going to be necessarily too small instead of the right value (this is different from the linear model method used in the first part of this post, which allows you to extrapolate). This is supposed to be a less biased approach, since it doesn’t allow you to extrapolate beyond the bounds of your observed data, but it can actually induce bias when you have pockets of missingness with no observed xs (which I would argue might happen frequently!). Here is an example of one of the imputed datasets, notice nothing is above that 2.3 line!

ggplot(complete(imp_default_c_x), aes(x = x_obs, y = y, color = factor(x_miss))) + 
  geom_point() + 
  scale_x_continuous(limits = c(-2.8, 3.1)) +
  geom_vline(xintercept = 2.31, lty = 2) +
  labs(color = "imputed")

What if we use `y` in the imputation models

Including y in the imputation model is definitely recommended, as was hammered home for me by the wonderful Frank Harrell, but I’m not sure this recommendation has permeated through the field yet (although this paper re-iterating this result just came out yesterday so maybe it is!).

Let’s see how that improves our imputation models:

imp_y_c_x <- mice(
  data, 
  m = 5, 
  formulas = list(x_obs ~ c_x + y),
  print = FALSE)

mod_imp_c_x <- with(imp_y_c_x, lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_c_x

  estimate  conf.low conf.high
1 1.034138 0.9109341  1.157343

Beautiful!

mod_double_c_x <-  with(imp_y_c_x, lm(y ~ x_obs + c_x)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_c_x

  estimate  conf.low conf.high
1  1.08074 0.8772936  1.284186

A bit worse, but not bad!

imp_y_c_y <- mice(
  data, 
  m = 5, 
  formulas = list(x_obs ~ c_y + y),
  print = FALSE)

mod_imp_c_y <- with(imp_y_c_y, lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_c_y

   estimate  conf.low conf.high
1 0.9298631 0.8137157  1.046011

Not bad!! A bit biased but way better.

mod_double_c_y <-  with(imp_y_c_y, lm(y ~ x_obs + c_y)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_c_y

  estimate  conf.low conf.high
1  1.01812 0.9415964  1.094644

Love it, looks great after conditioning on c_y

imp_y_noise <- mice(
  data, 
  m = 5, 
  formulas = list(x_obs ~ noise + y),
  print = FALSE)

mod_imp_noise <-  with(imp_y_noise, lm(y ~ x_obs)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_imp_noise

   estimate  conf.low conf.high
1 0.9671593 0.8580147  1.076304

Oo lala, even does well when we don’t have the right model (this makes sense because we are using y!)

mod_double_noise <-  with(imp_y_noise, lm(y ~ x_obs + noise)) |>
  pool() |>
  tidy(conf.int = TRUE) |>
  filter(term == "x_obs") |>
  select(estimate, conf.low, conf.high)

mod_double_noise

   estimate  conf.low conf.high
1 0.9697639 0.8612499  1.078278

Let’s put those in the original plot:

bind_rows(
  mod_full_c_x,
  mod_full_c_y,
  mod_full_noise,
  mod_cc_c_x,
  mod_cc_c_y,
  mod_cc_noise,
  mod_imp_c_x,
  mod_imp_c_y,
  mod_imp_noise,
  mod_double_c_x,
  mod_double_c_y,
  mod_double_noise
) |>
  mutate(
    mod = factor(c("Full data with c_x", 
                   "Full data with c_y", 
                   "Full data with noise",
                   "Complete case with c_x",
                   "Complete case wtih c_y",
                   "Complete case with noise",
                   "Imputation with c_x and y",
                   "Imputation with c_y and y",
                   "Imputation with noise and y",
                   "Two models with c_x",
                   "Two models with c_y",
                   "Two models with noise" ),
                 levels = c("Full data with c_x", 
                            "Complete case with c_x",
                            "Imputation with c_x and y",
                            "Two models with c_x",
                            "Full data with c_y", 
                            "Complete case wtih c_y",
                            "Imputation with c_y and y",
                            "Two models with c_y",
                            "Full data with noise",
                            "Complete case with noise",
                            "Imputation with noise and y",
                            "Two models with noise" )),
    mod = fct_rev(mod),
  ) -> to_plot

ggplot(to_plot, aes(x = estimate, xmin = conf.low, xmax = conf.high, y = mod)) +
  geom_pointrange() + 
  geom_vline(xintercept = 1, lty = 2)

Pretty good! Maybe I’m feeling a little better about imputation. Maybe. But it had better include the outcome (which I’ll admit feels very weird for my little causal brain).

We only have c_x

We only have c_y

What if both are wrong?

Addendum!

What if we use y in the imputation models

We only have `c_x`

We only have `c_y`

What if we use `y` in the imputation models